3.3 Q4 Term Frequency-Inverse Document Frequency (TF-IDF)

To run a TF-IDF, we used Spark-NLP Processing Job because of the huge dataset size with the HashingTF feature transformer. When using Spark’s HashingTF feature transformer, the challenge is that it hashes words into a fixed-size feature vector. This hashing process makes it efficient but also means you lose the direct mapping between words and their indices in the feature vector, which can make it difficult to retrieve the original words from the indices.

Table 2: Top 10 Words by TF-IDF Scoring: Highlighting Unique Vocabulary

	Term	Frequency
0	blockchain	2159.020063
1	burning	1217.472828
2	adventures	988.120861
3	above	969.387716
4	buffet	968.732854
5	are	927.111189
6	ceoofdogecoin	870.622154
7	240k	827.099049
8	announces	817.186108
9	career	780.541266

(Term frequency from sample dataset)